Medical image fusion with deep neural networks

Medical image fusion aims to fuse multiple images from a single or multiple imaging modes to enhance their corresponding clinical applications in diagnosing and evaluating medical problems, a trend that has attracted increasing attention. However, most recent medical image fusion methods require prior knowledge, making it difficult to select image features. In this paper, we propose a novel deep medical image fusion method based on a deep convolutional neural network (DCNN) for directly learning image features from original images. Specifically, source images are first decomposed by low rank representation to obtain the principal and salient components, respectively. Following that, the deep features are extracted from the decomposed principal components via DCNN and fused by a weighted-average rule. Then, considering the complementary between the salient components obtained by the low rank representation, a simple yet effective sum rule is designed to fuse the salient components. Finally, the fused result is obtained by reconstructing the principal and salient components. The experimental results demonstrate that the proposed method outperforms several state-of-the-art medical image fusion approaches in terms of both objective indices and visual quality.

shearlet filters, which makes full use of the limited redundancy of the DTCWT and the directional selectivity of the shearlet filter.Besides, sparse representation (SR) methods have been widely researched in image fusion field 28 .For example, Yang et al. first applied the SR in image fusion field, which obtains satisfactory performance 28 .Liu et al. proposed a general fusion framework by combining the SR and MST so as to improve the details of fusion result 29 .Liu et al. introduced a convolutional sparse representation (CSR) into image fusion to overcome the limited ability in detail preservation caused by SR method 30 .These publications have been demonstrated the SR is able to achieve the image fusion.
In recent years, deep learning has become a hot topic in image processing field.Several deep models have been applied in image fusion [31][32][33][34] .For instance, in Liu's paper 31 , the multi-focus image fusion problem is regarded as binary image classification, and the convolutional neural network (CNN) is used to learn the weight map from a large number of labeled training samples.Li et al. proposed a novel deep learning architecture for infrared and visible images fusion problem via combining convolutional layers, fusion layer and dense block 32 .These methods demonstrated that the deep learning model is an effective tool to achieve image fusion.
However, although traditional methods can achieve medical image fusion, the issue of how to utilize CNN effectively for medical image fusion is still an open question.In this paper, a novel medical image fusion method is proposed based on a pre-trained CNN model, called deep medical image fusion (DMIF).First, the source images are decomposed into the principal and salient components via low rank representation.Then, the pre-trained CNN model is utilized to extract the deep features of principal components, and a weighted-average strategy is adopted to fuse the deep features.Second, the salient components are fused with a sum strategy.Finally, the principal and salient components are reconstructed to obtain the fusion result.Experimental results demonstrate that the proposed method can obtain outstanding performance against several state-of-the-art approaches.
The main contributions of this work are shown as follows: (1) A novel medical image fusion approach based on CNN is proposed.The proposed method opens new opportunities for the medical image fusion.(2) Low rank representation is adopted to decompose the medical images into two components for the first time, i.e., principal and salient layers, which is able to effectively separate the structure information and detail information.(3) Experiments on 44 pairs of medical images verify that the fusion performance obtained by the proposed method outperforms other approaches.
The remaining of this work is organized as follows.In Section "Proposed method", we introduce the proposed method in detail.Section "Experiments" presents the experiment results and analyses.Finally, Section "Conclusions" gives the conclusions of this work.

Related work
VGG-16 35 is a convolutional neural network (CNN) architecture designed for image classification.It was proposed by the Visual Geometry Group at the University of Oxford and was one of the top-performing models in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2014.The architecture is characterized by its depth, utilizing 16 weight layers, named as VGG-16.The network consists of 16 weight layers, including 13 convolutional layers using 3*3 kernel with a stride size of 2 and 3 fully connected layers.The convolutional layers are followed by max-pooling layers using 2*2 filters with a stride size of 2 to reduce spatial dimensions.After each pooling layer, the size of the feature map is reduced by half.The last feature map before the fully connected layers is 7*7 with 512 channels.Despite its simplicity and straightforward architecture, VGG-16 has demonstrated strong performance on various computer vision tasks and served as a foundation for more advanced architectures.Due to the strong feature representation ability, the VGG-16 is considered as feature extractor in this work.

Proposed method
Figure 1 shows the flow chart of the proposed fusion framework, which consists of three steps: First, the original images are decomposed into principal and salient components.Second, two different fusion rules are proposed to achieve two component fusion, respectively.Finally, the fused principal and salient components are reconstructed together.The details of our method are shown as follows:

Low rank representation-based decomposition
Low rank representation (LRR) is to seek the lowest-rank representation among all the candidates that represent all vectors as the linear combination of the bases in a dictionary.Unlike the well-known sparse representation, which computes the sparest representation of each data vector individually, LRR aims at finding the lowest-rank representation of a collection of vectors jointly.Liu et al. 36 proposed low rank representation to segment data for the first time.However, the LRR fails to preserve the local structure information.Thus, a latent low-rank representation (LatLRR) is proposed, which can extract both the global structure and local details.The LatLRR is formulated as follows: where is a free parameter.� • � * is the nuclear norm which is the sum of the singular values of the matrix, and � • � 1 denotes the l 1 -norm.X is the original image.Z is the low-rank coefficients, and L is the saliency coefficients.E indicates the sparse noisy coefficient.XZ denotes the low-rank part, and LX indicates the saliency part.
(1) min Since medical images have high frequency, the low rank representation is adopted in this work.The two source images, denoted as I 1 and I 2 , are decomposed by the LatLRR to obtain the low-rank part I r i and saliency part I s i ,i ∈ {1, 2} .The main aim of this step is to obtain the main structure information and spatial details, which is beneficial for preserving the spatial information of source images.

Fusion of low-rank parts
Low-rank part mainly reflects global structure information and brightness information of the original images.In order to effectively fuse the low-rank parts of original images, the CNN model is adopted to extract the deep features.Specifically, first, the deep features of the low-rank parts are constructed via the pre-trained CNN, i.e., VGG-16.The CNN model can be thought as a composition of a number of functions.
where each function f l takes the data samples X l and a filter ω l as inputs and outputs I l+1 , l ∈ {1, 2, ..., L} , and L is the number of layers.
For the pre-trained CNN model, the filter banks ω l has been learned from some big dataset, e.g.ImageNet.Suppose the input image I r i , the multi-layer features are extracted, which is shown as follows: In our work, three convolutional layers, i.e., 'conv1' , ' conv2' , ' conv3' , are adopted.Then, a block-based average strategy is used to evaluate the weight of each feature.
Next, the initial weights are calculated by soft-max operator. (2) (5) ωk (x, y) = ωk (x, y) N k=1 ωk (x, y) . As we all know, the pooling operator in VGG-16 model is a kind of subsampling method, and thus, this operator decreases the spatial size of the feature maps to 1 s times of input where s = 2 is the stride of the pool- ing layer.Thus, in different layers, the size of feature maps is 1 2 i−1 times of the input image.After we obtain the initial weight map, an upsampling operator is utilized to increase the spatial size of the obtained weights to the one of the source images.
Finally, the multi-layer features are merged together to obtain the fused low-rank part.

Fusion of salient components
The saliency parts preserve the local detail information.Since the saliency features in source images have strong complementary information, a sum rule is adopted to fuse the saliency parts so as to preserve more details.

Image reconstruction
When the fused low-rank part G and saliency part H are obtained, the fusion result will be reconstructed as follows: where R is the resulting image.

Experiments
In order to verify the superiority of the proposed method, several state-of-the-art medical image fusion methods including convolutional neural networks and non-subsampled contourlet transform (CNN-NSCT) 37 , nonsubsampled shearlet transform fusion method (NSST) 38 , nonsubsampled shearlet transform based structure tensor method (NSST-ST) 39 , sparse representation (SR) 28 , multi-scale transform based domain sparse representation fusion method (MST-SR) 29 , guided filtering based fusion method (GFF) 22 , and discrete stationary wavelet transform and an enhanced radial basis function neural network (DSWT-RBFN) 40 , are adopted for comparison.For the MST-SR method, the Laplacian pyramid is used as the multi-scale transform.In addition, the default parameter settings of all compared methods are consistent with the corresponding publications given by the authors.

Test images
In our experiment, 44 pairs of multi-modal medical images, i.e.CT and MRI, MRI and PET, MRI and SPECT, MR-T1 and MR-T2, are adopted as experimental datasets.Figure 2 shows the original images, which have been accurately registered before fusion.The size of each source image is 256 * 256 pixels.

Objective indexes
In order to quantitatively assess the fusion performance of different approaches, several widely used objective quality indexes are adopted in our experiment, including standard deviation (SD) 41 , entropy (EN) 41 , normalized mutual information Q MI 42 , and phase congruency Q PC

41
. SD calculates the overall contrast of the resulting image.EN measures the amount of information in the fused result.Q MI denotes how much information from source images into the resulting image.Q PC reflects image details in the fused result.

CT and MRI image fusion
The first experiment is to fuse CT and MRI images.Figure 3 presents the fusion images obtained by different methods.As shown in Fig. 3, the CNN-NSCT method leads to some loss of local details.The NSST method yields low contrast fusion result.The NSST-ST method fails to well fuse the information of source CT image.The fusion result obtained by SR method appears distortion phenomenon.The MST-SR method suffers from the detail loss of source MRI image.The GFF method decreases the brightness of source MRI image.Although the DSWT-RBFN method retains well the information of original images, the boundaries in the fused image are blurry.In comparison, the proposed method can retain the detail information of original images.In addition, the boundaries in the fused result are visible clearly.( 6)  www.nature.com/scientificreports/ To illustrate objectively the fusion quality of resulting images achieved by different approaches, Table 1 shows the average objective metrics on 11 pairs of CT and MRI images.It can be seen from Table 1 that the proposed approach yields the highest objective indexes in terms of SD, EN, Q MI , and Q PC , which further proves that the proposed method can effectively achieve the CT and MRI image fusion over other methods.For the computational time, it is show that the running time of the proposed method is acceptable among all studied techniques.Moreover, compared to the CNN-NSCT, the proposed method performs faster.

MRI and PET image fusion
The second experiment is performed on the MRI and PET images.The fusion images yielded by different methods are shown in Fig. 4. In this example, the CNN-NSCT and NSST methods produce unsatisfactory performance caused by loss of energy.The NSST-ST method fails to preserve well the color information of the source PET image.The SR and GFF methods lead to severe color distortion.The resulting image of the DSWT-RBFN method cannot well retain the color information in the original PET image.In contrast, the proposed method can obtain higher visual effect in terms of both color preservation and detail extraction against other studied methods.
The average objective indexes of different approaches on 11 pairs of MRI and PET images are presented in Table 2.We can see that the proposed method still produces the best performance against other compared methods, which also further verifies the advantage of the proposed method.For running time of all methods,  www.nature.com/scientificreports/ it can be seen that the GFF method is efficient since it only requires image decomposition and fusion without involving any deep features.The computing time of the proposed method is moderate among all approaches.

MRI and SPET image fusion
The third experiment is tested on MRI and SPECT images.Figure 5 presents the fused images of different approaches.It is easy to observe that the proposed method still can retain the energy and details of source images.The compared methods cannot preserve color fidelity.Besides, the average objective indexes of different methods 11 pairs of MRI and SPET images are given in Table 3.It can be clearly observed that our method outperforms other studied methods on all the metrics, which is consistent with visual effect.

MR-T1 and MR-T2 image fusion
The fourth experiment is conducted on MR-T1 and MR-T2 images.The fused results of different fusion approaches are presented in Fig. 6.The NSST-PCNN, SR, and MST-SR methods suffer from detail loss in the fused results.The NSST and DSWT-RBFN methods fail to well inject the MR-T2 image into the fused images.
The GFF method cannot well integrate the details from MR-T1 image.By contrast, the proposed method can effectively merge the MR-T1 and MR-T2.Furthermore, Table 4 lists the objective metrics of all approaches on 11 pairs of MR-T1 and MR-T2 images.It is obvious that the proposed method still obtains the highest fusion results in terms of four indexes, which further illustrate the effectiveness of the proposed method.   , CNN-based fusion in NSST domain (NSST-CNN) 44 , asymmetric dual deep network with sharing mechanism (ADDNS) 45 , perceptual high frequency CNN (PHF-CNN) 37 , multiscale double-branch residual attention network (MSDRA) 46 , are adopted for comparison.An experiment is performed on 11 pairs of CT and MRI medical images.Table 5 presents the objective results of all compared methods.It can be observed that the proposed method still yields the highest fusion performance among all fusion techniques.This experiment also further verifies the effectiveness of the proposed method.

Conclusions
In this work, a novel deep medical image fusion method based on a deep convolutional neural network (DCNN) is proposed for directly learning image features from original images.Specifically, source images are first decomposed by low rank representation to obtain the principal and salient components, respectively.Following that, the deep features are extracted from the decomposed principal components via DCNN and fused by a weightedaverage rule.Then, considering the complementary between the salient components obtained by the low rank representation, a simple yet effective sum rule is designed to fuse the salient components.Finally, the fused result is obtained by reconstructing the principal and salient components.Experimental results verify that the proposed fusion method outperforms several state-of-the-art approaches.In future research work, the proposed method will be extended on multi-sensor image fusion. https://doi.org/10.1038/s41598-024-58665-9

Figure 1 .
Figure 1.Schematic of the proposed medical fusion method.The principal components record the global structure information and brightness information of the original images.The salient components the local detail information.

Table 1 .
Average objective indexes of different methods on 11 pairs of CT and MRI images.Significant values are in bold.

Table 2 .
Average objective indexes of different methods on 11 pairs of MRI and PET images.Significant values are in bold.

Table 3 .
Average objective indexes of different methods on 11 pairs of MRI and SPET images.Significant values are in bold.

Table 4 .
Average objective indexes of different methods on 11 pairs of MR-T1 and MR-T2 images.Significant values are in bold.

Table 5 .
Average objective results of different CNNs-based fusion methods on 11 pairs of CT and MRI images.